When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries

نویسندگان

  • Aylin Caliskan
  • Fabian Yamaguchi
  • Edwin Dauber
  • Richard E. Harang
  • Konrad Rieck
  • Rachel Greenstadt
  • Arvind Narayanan
چکیده

The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. Previous work has examined attribution of authors from both source code and compiled binaries, and found that while source code can be attributed with very high accuracy, the attribution of executable binary appears to be much more difficult. Many potentially distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and compiler optimization may alter the structure of a program, further obscuring features that are known to be useful in determining authorship. We examine executable binary authorship attribution from the standpoint of machine learning, using a novel set of features that include ones obtained by decompiling the executable binary to source code. We show that many syntactical features present in source code do in fact survive compilation and can be recovered from decompiled executable binary. This allows us to add a powerful set of techniques from the domain of source code authorship attribution to the existing ones used for binaries, resulting in significant improvements to accuracy and scalability. We demonstrate this improvement on data from the Google Code Jam, obtaining attribution accuracy of up to 92% with 100 candidate programmers. We also demonstrate that our approach is robust to basic obfuscations, a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. Finally, for the first time we are aware of, we demonstrate that authorship attribution can be performed on both obfuscated binaries, and real world code found “in the wild” by performing attribution on single-author GitHub repos-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

De-anonymizing Programmers via Code Stylometry

Source code authorship attribution is a significant privacy threat to anonymous code contributors. However, it may also enable attribution of successful attacks from code left behind on an infected system, or aid in resolving copyright, copyleft, and plagiarism issues in the programming fields. In this work, we investigate machine learning methods to de-anonymize source code authors of C/C++ us...

متن کامل

Poster: Atoms of Style: Identifying the Authors of Program Binaries

Being able to identify the author of a program has many applications in both academic and commercial environments. In most use cases, the source code is readily available, and this is reflected in the literature, as previous work has mostly focused on source code analyses. In contrast, scant research has been carried out on identifying the authors of executable program binaries. This would be m...

متن کامل

On the Feasibility of Malware Authorship Attribution

There are many occasions in which the security community is interested to discover the authorship of malware binaries, either for digital forensics analysis of malware corpora or for thwarting live threats of malware invasion. Such a discovery of authorship might be possible due to stylistic features inherent to software codes written by human programmers. Existing studies of authorship attribu...

متن کامل

Who Wrote This Code? Identifying the Authors of Program Binaries

Program authorship attribution—identifying a programmer based on stylistic characteristics of code—has practical implications for detecting software theft, digital forensics, and malware analysis. Authorship attribution is challenging in these domains where usually only binary code is available; existing source code-based approaches to attribution have left unclear whether and to what extent pr...

متن کامل

Type-preserving Compilation for End-to-end Verification of Security Enforcement

A number of programming languages use rich type systems to verify security properties of code. Some of these languages are meant for source programming, but programs written in these languages are compiled without explicit security proofs, limiting their utility in settings where proofs are necessary, e.g., proof-carrying authorization. Others languages do include explicit proofs, but these are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1512.08546  شماره 

صفحات  -

تاریخ انتشار 2015